Discriminative Training: Learning to Describe Video with Sentences, from Video Described with Sentences
نویسندگان
چکیده
We present a method for learning word meanings from complex and realistic video clips by discriminatively training (DT) positive sentential labels against negative ones, and then use the trained word models to generate sentential descriptions for new video. This new work is inspired by recent work which adopts a maximum likelihood (ML) framework to address the same problem using only positive sentential labels. The new method, like the ML-based one, is able to automatically determine which words in the sentence correspond to which concepts in the video (i.e., ground words to meanings) in a weakly supervised fashion. While both DT and ML yield comparable results with sufficient training data, DT outperforms ML significantly with smaller training sets because it can exploit negative training labels to better constrain the learning problem.
منابع مشابه
Learning to Describe Video with Weak Supervision by Exploiting Negative Sentential Information
Most previous work on video description trains individual parts of speech independently. It is more appealing from a linguistic point of view, for word models for all parts of speech to be learned simultaneously from whole sentences, a hypothesis suggested by some linguists for child language acquisition. In this paper, we learn to describe video by discriminatively training positive sentential...
متن کاملGrounded Language Learning from Video Described with Sentences
We present a method that learns representations for word meanings from short video clips paired with sentences. Unlike prior work on learning language from symbolic input, our input consists of video of people interacting with multiple complex objects in outdoor environments. Unlike prior computer-vision approaches that learn from videos with verb labels or images with noun labels, our labels a...
متن کاملUnsupervised Alignment of Natural Language with Video
Today we encounter large amounts of video data, often accompanied with text descriptions (e.g., cooking videos and recipes, videos of wetlab experiments and protocols, movies and scripts). Extracting meaningful information from these multimodal sequences requires aligning the video frames with the corresponding sentences in the text. Previous methods for connecting language and videos relied on...
متن کاملEvaluation of Midwifery Student's Attitude, Performance and Satisfaction from teaching clinical skills with the Video in Hamedan School of Nursing and Midwifery (2019)
1. Duncan I, Yarwood-Ross L, Haigh C..YouTube as a source of clinical skills education. Nurse Eduction. .2013; 33 (12): 1576–1580 2. Arguel ., Jamet E. Using video and static pictures to improve learning of procedural contents.Comput. Hum. Behav.2008; 25 (2):354–359. 3. Johnson N, List-Ivankovic J, Eboh W, Ireland ., Adams D, Mowatt E, Martindale S. Research and evidence based pra...
متن کاملSee, Hear, and Read: Deep Aligned Representations
We capitalize on large amounts of readily-available, synchronous data to learn a deep discriminative representations shared across three major natural modalities: vision, sound and language. By leveraging over a year of sound from video and millions of sentences paired with images, we jointly train a deep convolutional network for aligned representation learning. Our experiments suggest that th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1306.5263 شماره
صفحات -
تاریخ انتشار 2013